An active learning approach to train a deep learning algorithm for tumor segmentation from brain MR images

Purpose This study focuses on assessing the performance of active learning techniques to train a brain MRI glioma segmentation model. Methods The publicly available training dataset provided for the 2021 RSNA-ASNR-MICCAI Brain Tumor Segmentation (BraTS) Challenge was used in this study, consisting of 1251 multi-institutional, multi-parametric MR images. Post-contrast T1, T2, and T2 FLAIR images as well as ground truth manual segmentation were used as input for the model. The data were split into a training set of 1151 cases and testing set of 100 cases, with the testing set remaining constant throughout. Deep convolutional neural network segmentation models were trained using the NiftyNet platform. To test the viability of active learning in training a segmentation model, an initial reference model was trained using all 1151 training cases followed by two additional models using only 575 cases and 100 cases. The resulting predicted segmentations of these two additional models on the remaining training cases were then addended to the training dataset for additional training. Results It was demonstrated that an active learning approach for manual segmentation can lead to comparable model performance for segmentation of brain gliomas (0.906 reference Dice score vs 0.868 active learning Dice score) while only requiring manual annotation for 28.6% of the data. Conclusion The active learning approach when applied to model training can drastically reduce the time and labor spent on preparation of ground truth training data. Critical relevance statement Active learning concepts were applied to a deep learning-assisted segmentation of brain gliomas from MR images to assess their viability in reducing the required amount of manually annotated ground truth data in model training. Key points • This study focuses on assessing the performance of active learning techniques to train a brain MRI glioma segmentation model. • The active learning approach for manual segmentation can lead to comparable model performance for segmentation of brain gliomas. • Active learning when applied to model training can drastically reduce the time and labor spent on preparation of ground truth training data. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13244-023-01487-6.

was computed for each iteration in each model. The "best" iteration for a given model was determined as the iteration with the maximum average Dice score and from here the rest of the metrics of this iteration were reported.

C. Quantitative evaluation
Each voxel in the image can then be classified as either true positive (TP), true negative (TN), false positive (FP), or false negative (FN). The sensitivity is defined as the ratio of TP to the combined TP and FN and can be seen in equation 1. In other words, sensitivity describes the ability of the model to correctly identify voxels belonging to the glioma. Sensitivity is often accompanied by specificity, however because specificity is dependent on the volume of the glioma, which varies greatly from patient to patient, it does not convey any useful information [1]. An alternative to specificity that can be used is PPV, which is defined as the ratio of TP to the combined TP and FP seen in equation 2. This describes the proportion of correctly identified glioma voxels.
The Dice score is defined as twice the overlapping voxels of the segmentation and ground truth divided by the combined number of voxels of each and shown in equation 3. This value ranges between 0 and 1 and represents the proportion of overlap between the predicted and ground truth segmentations. A value closer to 1 suggests more overlap between the predicted and ground truth segmentations and is therefore preferred. Jaccard Similarity Coefficient is defined as the size of the intersection between the segmentation and ground truth divided by the size of the union of the two and is shown in equation 4. Jaccard Similarity Coefficient, like the Dice Score, ranges from 0 to 1 and explains the similarity between the two sets of segmentation voxels with values closer to 1 being preferred.
Hausdorff Distance is defined as the maximum of the minimum distances between two sets of points (A and B) in space. It is shown in equation 5 for which set A is rewritten as a i and set B as b j . The distance between voxels a i and b j is then denoted as δ (a i , b j ) as the Euclidian Distance between the center of a i and center of b j [1]. It describes the distance between two sets of voxels and so a smaller Hausdorff Distance is preferred. Because of the nature of the equation, Hausdorff Distance becomes very sensitive to noise [1]. This issue can be addressed using a Modified Hausdorff Distance which replaces the maximum distance with average distance [2]. The equation for Modified Hausdorff Distance can be seen in equation 6.
The results of the classification model were evaluated using the sensitivity, specificity, PPV, F-Score, and area under the receiver operating characteristic curve (AUC). F-score ranges from 0 to 1 and is calculated from the precision (also called PPV) and recall (also called sensitivity) as their harmonic mean [3]. The equation for F-score can be seen in Equation 7. The AUC represents the probability of the classifier to rank a randomly chosen positive instance higher than a randomly chosen negative instance [4]. Despite ranging from 0 to 1, a perfectly random guessing results in a diagonal line with an AUC of 0.5 and so the more realistic range for an AUC is instead from 0.5 to 1.  Figure 3. Example input images for the Dice score predictor including from left to right: T1c, T2, FLAIR, and the predicted segmentation probability map.

Segmentation Probability
Supplementary Figure 4. Confusion matrix for the classification of predicted segmentations into "Poor Quality", "Acceptable with Adjustments", and "Acceptable Quality".

Poor Quality
Acceptable with Adjustments Acceptable Quality

Predicted Label
Acceptable with Adjustments Acceptable Quality

True Label
Poor Quality

Discussion
In this study, the application of an active learning approach to segment whole brain gliomas from MRI was assessed. The key benefit to the active learning concept lies in its potential reduction of data requirements, with preferential data being selected for model training through feedback from the model. After three baseline segmentation models were trained as reference, active learning was applied to the two models of reduced dataset size using a Dice score threshold and the training sets were updated based on the queried data. While this first step allowed for the assessment of the viability of active learning in training glioma segmentation models as a concept, it relied on prior knowledge of the ground truth data for the unseen cases to compute Dice scores. In a clinical or real-world setting, this would not be practical as one would want to utilize all available training data that is accompanied by a ground truth segmentation to train the best model possible. A secondary Dice score predictor was then developed to address this challenge with the goal of classifying predicted segmentations into those of "Poor Quality", "Acceptable with Adjustments", and "Acceptable Quality" using Dice score thresholds of below 0.6, between 0.6 and 0.8, and above 0.8, respectively. Because the use of the Dice score in selecting cases for active learning is not feasible in a real world setting with a lack of ground truth data, it was also important to demonstrate that the classification model could be applied to the active learning itself more than just as a concept. For this evaluation it was compared with the results of a model using the Dice scores. In both models, a drastic reduction in the number of cases requiring manual ground truth segmentations was seen, with the reference Dice method model requiring just 46% of the ground truth images and the classification method model requiring just 43%. The similarity in the reduction of ground truth cases was also mirrored by a similarity in segmentation performance with the Dice method model demonstrating a Dice score performance of 0.885 and the classification method model demonstrating a Dice score performance of 0.860. Comparable results between the reference method that required manual ground truth data and the Dice score predictor method that was more representative of a real-world situation demonstrated that an active learning approach can be a viable technique when facing real world situations rather than just in proof-of-concept settings. The results of quantitative analysis of the segmentation models demonstrated that an active learning approach when applied to glioma segmentation from MR images shows comparable segmentation results to reference non-active learning models but at a lower ground truth cost. With active learning, the average Dice score of the predicted segmentations of T100 rose from 0.865 to 0.870 for Model B and from 0.825 to 0.868 for Model C. While these two models did not quite reach the Dice score of the reference Model A (0.906), the Dice scores were still comparably high and with much less manual segmentation required for training. For Model B, only 127 of the additional 576 cases required manual segmentation for a total of 702 of the 1151 cases. This reduced the total number of cases needing an expert's manual segmentation by 449 or 39.0% of the total training dataset. For Model C, across all 3 rounds of active learning only 229 of the additional 1051 cases required manual segmentation, reducing the number of total training cases with expert manual segmentation by 822 and meaning that only 329 or 28.6% of the 1151 cases required manual segmentation. These drastic reductions in manual segmentation required would greatly save in the cost of time and labor by trained experts. Though the segmentations through active learning did not quite reach the levels of the reference model, there is a trade-off in which the reductions of manually segmented ground truth data required can make up for this. This may be especially useful in tasks for which there is more leniency in the precision and so the slight decrease in accuracy of the predicted segmentations is less important compared with the time and effort saved.
Though not for glioma segmentation specifically, various other studies have also implemented active learning techniques to medical image segmentation toward reducing manual segmentation data requirements. In a study applying active learning to interactive 3D image segmentation [5], an active learning technique involving uncertainty fields based on boundary, regional, smoothness and entropy terms was applied and tested on various segmentation tasks including putamen from brain MRI, liver in abdominal CT, and pelvic bones and muscles in both CT and MRI. The study found that in addition to either comparable or improved Dice scores, the active learning techniques also reduced user input by an average of 64%. This finding shows a similar reduction in human effort of segmentation as the present study with a 61% reduction in Model B and a 72.4% reduction in Model C. Another study focusing on generation of realistic chest x-ray images using a conditional generative adversarial network followed by a Bayesian neural network to calculate informativeness for active learning [6] similarly found that an active learning framework was able to achieve comparable results using only 35% of the full dataset. In a study of hippocampal segmentation from MR images [7], a Query-by-Committee approach to active learning was implemented and was able to achieve full segmentation accuracy using only 23% of the dataset. While these studies all show drastic reductions in data requirements, they each also use different approaches to the application of active learning concepts. This suggests two things-first, there are many different techniques to approach active learning while achieving similarly small data requirement results; second, with multiple possible techniques there may be an approach that works best for a given task and so future studies wishing to optimize the process may need to test multiple approaches.